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Assessment and accountability have played prominent roles in many of the education 
reform efforts during the past 50 years. In the 1950s, under the influence of James B. 
Conant's work on comprehensive high schools, testing was used to select students for 
higher education and to identify students for gifted programs. By the mid-1960s test 
results were used as one measure to evaluate the effectiveness of Title I and other 
federal programs. In the 1970s and early 1980s, the minimum competency testing 
movement spread rapidly; 34 states instituted some sort of testing of basic skills as a 
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graduation requirement. Overlapping the minimum competency testing movement and 
continuing into the late 1980s and early 1990s was the expansion of the use of 
standardized test results for accountability purposes. 

Assessment is appealing to policymakers for several reasons: it is relatively inexpensive 
compared to making program changes, it can be externally mandated, it can be 
implemented rapidly, and it offers visible results. This Digest discusses significant 
features of present-day assessment programs and offers recommendations to increase 
positive effects and minimize negative ones. 

WHAT ARE THE CHARACTERISTICS OF 
CURRENT REFORM EFFORTS? 



Although a number of other important features might be considered in any discussion of 
assessment and education reform (e.g., the emphasis on performance-based 
approaches to assessment, the concept of tests worth teaching to, and the politically 
controversial and technically challenging issue of opportunity to learn), I focus on the 
following three: 



* An emphasis on the development and use of ambitious content standards as the basis 
of assessment and accountability. 



* The dual emphasis on setting demanding performance standards and on the inclusion 
of all students. 



* The attachment of high-stakes accountability mechanisms for schools, teachers, and 
sometimes, students. 

Content standards. The federal government has encouraged states to develop content 
and performance standards that are demanding. Standards-based reform is also a 
central part of many of the state reform efforts, including ones such as Kentucky and 
Maryland that have been using standards-based assessments for several years and 
ones such as Colorado and Missouri that have more recently introduced 
standards-based assessment systems. A great deal has been written about the 
strengths and weaknesses of content standards (e.g., Education Week, 1997; Lerner, 

1 998; Olson, 1 998;Raimi & Braden, 1 998). It is worth acknowledging that content 
standards vary a good deal in specificity and in emphasis. Content standards can, and 
should, if they are to be more than window dressing, influence both the choice of 
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constructs to be measured and the ways in which they are eventually measured. 



Performance standards. 

Performance standards are supposed to specify how good is good enough. There are at 
least four critical characteristics of performance standards. First, they are intended to be 
absolute rather than normative. Second, they are expected to be set at high, world-class 
levels. Third, a relatively small number of levels (e.g., advanced, proficient) are typically 
identified. Finally, they are expected to apply to all, or essentially all, students, rather 
than a selected subset such as college-bound students seeking advanced placement. 

Should the intent be to aspire not just to high standards for all students, but to the same 
high standards for all students and on the same time schedule for all students (e.g., 
meet reading standards in English at the end of Grade 4)? Coffman (1993) sums up the 
problems of holding common high standards for all students as follows: "Holding 
common standards for all pupils can only encourage a narrowing of educational 
experiences for most pupils, doom many to failure, and limit the development of many 
worthy talents" (p. 8). Although this statement runs counter to the current Zeitgeist and 
may not even be considered politically correct, it seems to me a sensible conclusion 
that is consistent with both evidence and common sense. Having high standards is not 
the same as having common standards for all, especially when they are tied to a lock 
step of age or grade level. 



High-stakes accountability. 

The use of student performance on tests in accountability systems is not new. 
Examples of payment for results such as the flurry of performance contracting in the 
1960s can be found cropping up and fading away over many decades. What is 
somewhat different about the current emphasis on performance-based accountability is 
its pervasiveness. As Elmore, Abelmann, and Fuhrman note, "What is new is an 
increasing emphasis on student performance as the touchstone for state governance" 
(1996, p. 65). Student achievement is being used not only to single out schools that 
require special assistance, but also to provide cash incentives for improvements in 
performance. Yet several fundamental questions remain about the student 
assessments, the accountability model, and the validity, impact, and credibility of the 
system. 

As noted earlier, for example, the choice of constructs matters. Content areas (and sub 
areas within those content areas) that are assessed for a high-stakes accountability 
receive emphasis while those that are left out languish. Meyer (1996) has argued that 
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"in a high-stakes accountability system, teachers and administrators are likely to exploit 
all avenues to improve measured performance. For example, teachers may 'teach 
narrowly to the test.' For tests that are relatively immune to this type of corruption, 
teaching to the test could induce teachers and administrators to adopt new curriculums 
and teaching techniques much more rapidly than they otherwise would" (p. 140). 

It is unclear, however, that there is either the know-how or the will to develop 
assessments that are sufficiently "immune to this type of corruption." It is expensive to 
introduce a new, albeit well-equated, form of a test on each new administration. And if 
ambitious performance-based tasks are added to the mix, still greater increases in costs 
will result. 

A second area of concern regarding high-stakes assessments relates to what data the 
basic model should employ. Some possibilities include current status, comparisons of 
cross-sectional cohorts of students at different grades in the same year, comparisons of 
cross-sectional cohorts in a fixed grade from one year to the next, longitudinal 
comparisons of school aggregate scores without requiring matched individual data, and 
longitudinal comparisons based only on matched student records. Should simple 
change scores be used or some form of regression-based adjustment? And, if 
regression-based adjustments are used, what variables should be included as 
predictors? In particular, should measures of socioeconomic status be used in the 
adjustments? 

Elmore, Abelmann, and Furhman (1996) present both sides of this issue, noting that on 
the one hand, schools can fairly be held accountable only for those factors they can 
control, but on the other, controlling for student background or prior achievement 
institutionalizes low expectations for poor, minority, low-achieving students (pp. 93-94). 
Kentucky's interesting approach to this dilemma has been to set a common goal for all 
schools by the end of 20 years, thus establishing faster biennial growth targets for 
initially low-scoring schools than initially high-scoring schools (Guskey,1994). 

The biggest question of all is whether the assessment-based accountability models that 
are now being used or being considered by states and districts have been shown to 
improve education. Unfortunately, it is difficult to get a clear-cut answer to this simple 
question. Certainly, there is evidence that performance on the measures used in 
accountability systems increases over time, but that can also be linked to the use of old 
norms, the repeated use of test forms year after year, the exclusion of students from 
participating in accountability testing programs, and the narrow focusing of instruction 
on the skills and question types used on the test (see Koretz, 1 988; Linn et al., 1 990; 
Shepard, 1990). Comparative data are needed to evaluate the apparent gains. The 
National Assessment of Educational Progress provides one source of such data. 
Comparisons of state NAEP and state assessment results sometimes suggest similar 
trends; for example, increases in numbers of students scoring at or above basic or 
proficient levels on NAEP may track with improved state test scores over time. In other 
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cases, the trends for a state's own assessment and NAEP will suggest contradictory 
conclusions about the changes in student achievement. Divergence of trends does not 
prove that NAEP is right and the state assessment is misleading, but it does raise 
important questions about the generalizability of gains reported on a state's own 
assessment, and hence, about the validity of claims regarding student achievement. 

HOW CAN ASSESSMENTS BE USED MORE 
WISELY? 



Assessment systems that are useful monitors lose much of their dependability and 
credibility for that purpose when high stakes are attached to them. The unintended 
negative effects of the high-stakes accountability uses often outweigh the intended 
positive effects. It is worth arguing for more modest claims about uses that can validly 
be made of our best assessments and warning against the over-reliance on them that is 
so prevalent and popular. To enhance the validity, credibility, and positive impact of 
assessment and accountability systems while minimizing their negative effects, 
policymakers should: 



1. Provide safeguards against selective exclusion of students from assessments. 



2. Make the case that high-stakes accountability requires new high-quality assessments 
each year that are equated to those of, previous years. 



3. Don't put all of the weight on a single test. Instead, seek multiple indicators. The 
choice of construct matters and the use of multiple indicators increases the validity of 
inferences based upon observed gains in achievement. 



4. Place more emphasis on comparisons of performance from year to year than from 
school to school. This allows for differences in starting points while maintaining an 
expectation of improvement for all. 



5. Consider both value added and status in the system. Value added provides schools 
that start out far from the mark a reasonable chance to show improvement while status 
guards against institutionalizing low expectations for those same students and schools. 
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6. Recognize, evaluate, and report the degree of uncertainty in the reported results. 



7. Put in place a system for evaluating both the intended positive effects and the more 
likely unintended negative effects of the system. 
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